[DataProcessor] Refactor and unify text/multimodal processor pipeline#7747
[DataProcessor] Refactor and unify text/multimodal processor pipeline#7747luukunn wants to merge 17 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 2/10 通过
2.2 可选任务 — 24/28 通过
3 失败详情(仅 required)Approval — 代码规范(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请联系 xyxinyang 或 zyyzghb 完成日志变更审批 链接: 查看日志 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7747 +/- ##
==========================================
Coverage ? 63.58%
==========================================
Files ? 472
Lines ? 65920
Branches ? 10129
==========================================
Hits ? 41915
Misses ? 21150
Partials ? 2855
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| if content is None: | ||
| parsed_content = [] | ||
| parsed_content = content | ||
| elif isinstance(content, str): | ||
| parsed_content = [{"type": "text", "text": content}] | ||
| parsed_content = content | ||
| else: | ||
| parsed_content = [parse_content_part(mm_parser, part) for part in content] |
| if isinstance(self.tokenizer, (LlamaTokenizer, Llama3Tokenizer)) and not self.tokenizer.pad_token_id: | ||
| return self.tokenizer.eos_token | ||
| return self.tokenizer.pad_token_id |
| if prompt_token_ids[0] > self.tokenizer.vocab_size: | ||
| if not add_prefix_space: | ||
| log_request( | ||
| level=1, | ||
| message="bad_words: '{prompt}' token id {token_id} > vocab_size, skipping", | ||
| prompt=prompt, | ||
| token_id=prompt_token_ids[0], | ||
| ) | ||
| continue | ||
| if prompt_token_ids not in token_ids: | ||
| token_ids.extend(prompt_token_ids) |
| max_tokens = max_model_len - len(request["prompt_token_ids"]) | ||
| if request.get("max_tokens") is None: | ||
| request["max_tokens"] = max(1, max_tokens) | ||
| else: | ||
| request["max_tokens"] = min(max_tokens, request["max_tokens"]) | ||
|
|
||
| # Default reasoning_max_tokens (only for models that need it, e.g. Ernie) | ||
| if self.set_default_reasoning_max_tokens and request.get("reasoning_max_tokens") is None: |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-12 23:57:40
📋 Review 摘要
PR 概述:重构并统一文本模型与多模态模型的请求预处理流程,引入统一 Processor 框架并新增多个 VL 多模态处理器。
变更范围:fastdeploy/input/(processor、multimodal/)、fastdeploy/entrypoints/(llm.py、chat_utils.py)
影响面 Tag:[DataProcessor] [APIServer]
📝 PR 规范检查
标题格式 [DataProcessor] Refactor and unify text/multimodal processor pipeline 符合规范,[DataProcessor] 为官方合法 Tag;描述包含全部必填 Section(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist),内容完整。✓
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/input/processor.py:557 |
assert 用于运行时用户输入校验,python -O 下会被跳过 |
| 🟡 建议 | fastdeploy/input/multimodal/image_processors/ernie.py:148 |
set_pixels 中用 assert 校验参数,python -O 下失效 |
| 🟡 建议 | fastdeploy/input/multimodal/image_processors/ernie.py:153 |
同上,max_pixels 参数校验 |
| ❓ 疑问 | fastdeploy/entrypoints/chat_utils.py:203 |
None content 从旧的 [] 改为 None,语义变化需确认兼容性 |
| 🟡 建议 | — | 旧版处理器文件(base_processor.py、text_processor.py、multimodal_processor.py)仍存在,测试仍引用旧模块,建议明确清理计划 |
总体评价
整体重构思路清晰,新增多模态处理器框架结构合理,测试覆盖较为完善。存在若干 assert 用于运行时参数校验的问题需修复,并建议明确旧 processor 文件的清理计划。
| if not request.get("prompt_token_ids"): | ||
| if request.get("prompt"): | ||
| prompt = request.get("prompt") | ||
| assert isinstance(prompt, str) or ( |
There was a problem hiding this comment.
🟡 建议 assert 用于运行时用户输入校验,在 python -O 模式下会被跳过,导致后续逻辑收到非法类型后报出难以定位的错误。
建议替换为显式 raise ValueError:
if not (isinstance(prompt, str) or (isinstance(prompt, list) and all(isinstance(t, int) for t in prompt))):
raise ValueError(f"prompt must be a string or a list of integers, but got {type(prompt)}")|
|
||
| def set_pixels(self, min_pixels=None, max_pixels=None, msg=""): | ||
| if min_pixels is not None: | ||
| assert isinstance(min_pixels, int) and min_pixels >= 0, "min_pixels must be positive int" |
There was a problem hiding this comment.
🟡 建议 assert 用于参数合法性校验,python -O 下会被静默跳过。建议改为 raise ValueError:
if not (isinstance(min_pixels, int) and min_pixels >= 0):
raise ValueError("min_pixels must be a non-negative int")| self.min_pixels = min_pixels | ||
| self.size["min_pixels"] = int(min_pixels) | ||
| if max_pixels is not None: | ||
| assert isinstance(max_pixels, int) and max_pixels > 0, "max_pixels must be positive int" |
There was a problem hiding this comment.
🟡 建议 同 line 148,max_pixels 的 assert 在 python -O 下失效。建议改为:
if not (isinstance(max_pixels, int) and max_pixels > 0):
raise ValueError("max_pixels must be a positive int")| parsed_content = [] | ||
| if content is None: | ||
| parsed_content = [] | ||
| parsed_content = content |
There was a problem hiding this comment.
❓ 疑问 旧代码中 content is None 时 parsed_content = [](空列表),现在改为保留 None。
对于大多数 tokenizer 的 apply_chat_template,None 和 [] 的处理行为可能不同(例如 Jinja 模板中 {% if content %} vs {% for part in content %})。请确认所有支持模型的 chat template 均能正确处理 content=None 的消息。
Motivation
本 PR 主要对 FastDeploy 的 Processor 体系进行了合并与重构,统一了文本模型与多模态模型的请求预处理流程,提升了代码的可维护性和扩展性。
通过本次改动,将多模态输入处理逻辑进行模块化拆分,并接入统一的
Processor框架,为后续支持和维护 Qwen2.5-VL、Qwen3-VL、ERNIE 4.5 VL、PaddleOCR-VL 等模型提供更清晰的实现基础。Modifications
Processor实现(fastdeploy/input/processor.py),整合原有文本处理流程。fastdeploy/input/preprocess.py的 processor 创建逻辑:Processorfastdeploy/input/multimodal/,包括:MMProcessor抽象基类QwenVLProcessorQwen3VLProcessorErnie4_5VLProcessorPaddleOCRVLProcessorfastdeploy/input/multimodal/common.py,统一封装图像 resize、像素范围判断等公共逻辑。fastdeploy/input/multimodal/image_processors/:QwenImageProcessorQwen3ImageProcessorAdaptiveImageProcessorPaddleOCRImageProcessormessages:fastdeploy/entrypoints/llm.pyfastdeploy/entrypoints/chat_utils.pymessages -> prompt / multimodal_data的处理流程,减少文本和多模态路径之间的分叉逻辑。Usage or Command
可通过以下命令运行相关测试:
Accuracy Tests
本 PR 主要涉及 Processor 架构重构与多模态输入处理流程整理,不直接修改模型前向计算逻辑或算子实现。
因此本次未提供 accuracy 测试结果。
Checklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.